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Abstract 

Visual features are of vital importance for human 
action understanding in videos. This paper presents a 
new video representation, called trajectory-pooled deep- 
convolutional descriptor (TDD), which shares the merits of 
both hand-crafted features [31] and deep-learned features 
[24]. Specifically, we utilize deep architectures to learn 
discriminative convolutional feature maps, and conduct 
trajectory-constrained pooling to aggregate these convo¬ 
lutional features into effective descriptors. To enhance 
the robustness of TDDs, we design two normalization 
methods to transform convolutional feature maps, namely 
spatiotemporal normalization and channel normalization. 
The advantages of our features come from (i) TDDs 
are automatically learned and contain high discriminative 
capacity compared with those hand-crafted features; (ii) 
TDDs take account of the intrinsic characteristics of tem¬ 
poral dimension and introduce the strategies of trajectory- 
constrained sampling and pooling for aggregating deep- 
learned features. We conduct experiments on two chal¬ 
lenging datasets: HMDB51 and UCFIOI. Experimental 
results show that TDDs outperform previous hand-crafted 
features [31 ] and deep-learned features [24]. Our method 
also achieves superior performance to the state of the art 
on these datasets ^ 


1. Introduction 

Human action recognition [1, 24, 31, 35, 36] in videos 
attracts increasing research interests in computer vision 
community due to its potential applications in video surveil¬ 
lance, human computer interaction, and video content 
analysis. However, action recognition remains as a diffi¬ 
cult problem when focusing on realistic datasets collected 
from movies [17], web videos [15, 26], and TV shows 
[20]. There are large intra-class variations in the same 

^The TDD code and learned two-stream ConvNet models are available 

at https://wanglimin.github.io/tdd/index.html 



Figure 1. There are mainly two types of features in action 
recognition: hand-crafted features and deep-learned features. For 
hand-crafted features, improved trajectories [31] combined with 
Fisher vector are most successful. For deep-learned features. 
Convolutional Networks (ConvNets) [18] are popular deep archi¬ 
tectures, which contain a sequence of convolutional and pooling 
layers. They aims to automatically learn features with a deep 
discriminatively trained neural network. 


action class, which may be caused by background clutter, 
viewpoint change, and various motion speeds and styles. 
Meanwhile, the high dimension and low resolution of 
video further increases the difficulty to design efficient and 
robust recognition method. Visual representations from 
action videos are crucial for dealing with these issues and 
designing effective recognition systems. Currently, there 
are mainly two types of video features available for action 
recognition, as illustrated in Figure 1 . 

The first type of representations are the hand-crafted 
local features, and typical local features include Space 
Time Interest Points [16], Cuboids [7], Dense Trajectories 
[30], and Improved Trajectories [31]. Calculation of these 
local features can be usually decomposed into two phrases: 
detector, which aims to discover the salient and informative 
regions for action understanding, and descriptor, whose 
goal is to describe the visual patterns of extracted regions. 
Among these local features, improved trajectories with 
rich descriptors of HOG, HOF, MBH have shown to 
be successful on a number of challenging datasets (e.g. 
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HMDB51 [15], UCFIOI [26]) and contests (e.g. THUMOS 
[11]). Improved trajectories include several important in¬ 
gredients in their extraction process. Firstly, these extracted 
trajectories are mainly located at regions with high motion 
salience, which contain rich and discriminative information 
for action recognition. Secondly, these local descriptors of 
the corresponding regions in several successive frames, are 
aligned and pooled along the trajectories. This trajectory- 
constrained sampling strategy also takes account of the 
temporal continuity of human action, and is effective 
to deal with the variations of motion speed. However, 
these hand-crafted descriptors are not optimized for visual 
representation and may lack discriminative capacity for 
action recognition. 

The second type of representations are the deep-learned 
features, and typical methods include Convolutional RBMs 
[29], 3D ConvNets [9], Deep ConvNets [12], and Two- 
Stream ConvNets [24]. These deep learning methods aim 
to automatically learn the semantic representation from 
raw video by using a deep neural network discriminatively 
trained from a large number of labeled data. Two- 
Stream ConvNets [24] are probably the most successful 
architecture at present, and they match the state-of-the-art 
performance of improved trajectories [31, 32] on UCFIOI 
and HMDB51. They are composed of two neural networks, 
namely spatial nets and temporal nets. Spatial nets mainly 
capture the discriminative appearance features for action 
understanding, while temporal nets aim to learn the effec¬ 
tive motion features. However, unlike image classification 
tasks [14], these deep learning based methods fail to 
outperform previous hand-crafted features. One problem of 
deep learning methods is that they require a large number of 
labeled videos for training, while most available datasets are 
relatively small. Meanwhile, most of current deep learning 
based action recognition methods largely ignore the intrin¬ 
sic difference between temporal domain and spatial domain, 
and just treat temporal dimension as feature channels when 
adapting the architectures of ConvNets to model videos. 

Motivated by the above analysis, this paper proposes a 
new kind of video feature, called trajectory-pooled deep- 
convolutional descriptor (TDD). The design of TDD aims 
to combine the benefits of both hand-crafted and deep- 
learned features. To achieve this goal, our approach 
integrates the key factors from two successful video rep¬ 
resentations, namely improved trajectories [31] and two- 
stream ConvNets [24] . We utilize deep architecture to learn 
multi-scale convolutional feature maps, and introduce the 
strategies of trajectory-constrained sampling and pooling to 
encode deep features into effective descriptors. 

Specifically, we first train two-stream ConvNets on a 
relatively large dataset, while more labeled action videos 
will make ConvNet training more stable and robust. Then, 
we treat the learned two-stream ConvNets as generic feature 


extractors, and use them to obtain multi-scale convolutional 
feature maps for each video. Meanwhile, we detect a 
set of point trajectories with the method of improved 
trajectories. Based on convolutional feature maps and 
improved trajectories, we pool the local ConvNet responses 
over the spatiotemporal tubes centered at the trajectories, 
where the resulting descriptor is called TDD. Finally, we 
choose Fisher vector representation to aggregate these local 
TDDs over the whole video into a global super vector, 
and use linear SVM as the classifier to perform action 
recognition. We conduct experiments on two public action 
datasets: theHMDBSl dataset [15] and the UCFIOI dataset 
[26]. We show that our TDDs obtain the state-of-the-art 
performance for action recognition on these challenging 
datasets. Meanwhile, our results demonstrate that our 
TDDs are complementary to those hand-crafted features 
(HOG, HOF, and MBH) and the fusion of them is able to 
further boost the recognition performance. 

2. Related Works 

Hand-crafted features. Local features [7, 16, 33, 39] 
have become popular and effective representations in action 
recognition, as these local features do not require algo¬ 
rithms to detect human body and are robust to background 
clutter, illumination changes, and video noise. Space Time 
Interest Points [16] proposed Harris3D detector to extract 
informative regions, while Cuboid [7] detector relied on 
temporal Gabor filters. Willems et al [39] proposed a 
Hessian detector, which is a spatio-temporal extension of 
Hessian saliency measure used for blob detection in images. 
Meanwhile several local descriptors have been proposed to 
represent the 3D volumes extracted around these interest 
points, such as Histogram of Gradient (HOG), Histogram 
of Optical Flow (HOF) [17], 3D Histogram of Gradient 
(HOG3D) [13], and Extended SURF (ESURF) [39]. Recent 
works made use of point trajectories [30, 31] to extract 
and align 3D volumes, and resorted to more rich low level 
descriptors for constructing effective video representations, 
including HOG, HOF, and Motion Boundary Histogram 
(MBH). 

One limitation of these local features is that they lack 
semantics and discriminative capacity. To overcome this is¬ 
sue, several mid-level and high-level video representations 
have been proposed such as Action Bank [22], Dynamic- 
Poselets [37], Motionlets [35], Motion Atoms and Phrases 
[34], and Actons [42]. They usually resorted to some 
heuristic mining methods to select discriminative visual el¬ 
ements as feature units. Instead, this paper takes a different 
view of this problem and replace these local hand-crafted 
descriptors with deep-learned representations. Our deep 
representations deliver high level semantic information, and 
are learned automatically from training data without using 
these heuristic rules. 
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Figure 2. Pipeline of TDD. The whole proeess of extraeting TDD is composed of three steps: (i) extracting trajectories, (ii) extracting multi¬ 
scale convolutional feature maps, and (iii) calculating TDD. We effectively exploit two available state-of-the-art video representations, 
namely improved trajectories and two-stream ConvNets. Grounded on them, we conduct trajectory-constrained sampling and pooling over 
convolutional feature maps to obtain trajectory-pooled deep convolutional descriptors. 


Deep-learned features. Deep learning techniques have 
achieved great success in image based tasks [14, 25, 28, 41] 
and there have been a number of attempts to develop 
deep architectures for video action recognition [9, 12, 24, 
29]. Taylor et al. [29] used Gated Restricted Boltzmann 
Machines (GRBMs) to learn the motion features in an 
unsupervised manner and then resorted to convolutional 
learning to fine tune the parameters. Ji et al [9] extended 
2D ConvNet to video domain for action recognition on 
relatively small datasets, and recently Karpathy et al [12] 
tested ConvNets with deep structures on a large dataset, 
called Sports-IM. However, these deep models achieved 
lower performance compared with shallow hand-crafted 
representation [31], which might be ascribed to two facts: 
firstly, available action datasets are relatively small for 
deep learning; secondly, learning complex motion patterns 
is more challenging. Simony an et al [24] designed 
two-stream ConvNets containing spatial and temporal net 
by exploiting large ImageNet dataset for pre-training and 
explicitly calculating optical flow for capturing motion 
information, and finally it matched the state-of-the-art 
performance. 


However, these deep models lacked considerations of 
temporal characteristics of video data and relied on large 
training datasets. We incorporate video temporal char¬ 
acteristics into deep architectures by using strategy of 
trajectory-constrained sampling and pooling, and propose 
a new descriptor. Meanwhile, our descriptors can be easily 
adapted to the datasets of smaller size. 


3. Improved Trajectories Revisited 

As shown in Figure 2, our proposed representation 
(TDD) is based on low level trajectory extraction and we 
choose improved trajectories [31]. In this section, we briefiy 
review the extraction process of improved trajectories. It is 
worth noting that our TDD is independent of the method 
of extracting trajectories, and we use improved trajectories 
due to its good performance. 

Improved trajectories are extended from dense trajecto¬ 
ries [30]. To compute dense trajectories, the first step is to 
densely sample a set of points on 8 spatial scales on a grid 
with step size of 5 pixels. Points in homogeneous areas are 
eliminated by setting a threshold for the smaller eigenvalue 
of their autocorrelation matrices. Then these sampled points 
are tracked by media filtering of dense flow field. 

Pt+i = = {xt,yt) + {M * idt)\(xt,yt)^ 

where A4 is the median filter kernel, * is convolutional 
operation, ujt = {ut, Vt) is the dense optical fiow field of the 
^th fj-ame, and {xt^Vt) rounded position of (xt^yt)- 

To avoid the drifting problem of tracking, the maximum 
length of trajectory is set as 15-frame. Finally, those static 
trajectories are removed as they lack motion information, 
and other trajectories with suddenly large displacement are 
also ignored, since they are obviously incorrect due to 
inaccurate optical fiow. 

Improved trajectories boost the recognition performance 
of dense trajectories by taking camera motion into account. 
It assumes that the background motion of two consecutive 
frames can be characterized by a homography matrix. To 
estimate the homography matrix, the first step is to find 
the correspondence between two consecutive frames. They 










































































































resort to SURF [2] feature matching and optical flow 
based matching, as these two kinds of matching scheme 
are complementary to each other. Then, they use the 
RANSAC [8] algorithm to estimate homography matrix. 
Based on the homography, they rectify the frame image to 
remove the camera motion and re-calculate the optical flow, 
called warped flow. Warped flow brings advantages to the 
descriptors calculated from optical flows, in particular for 
HOF, and trajectories corresponding to camera motion can 
be removed too. 

We adopt improved trajectories for the task of TDD 
extraction, but make a modiflcation. Unlike dense trajec¬ 
tories or improved trajectories, we only track points on its 
original spatial scale, and extract multi-scale TDDs around 
the extracted trajectories (see Section 4). We observe that 
tracking on a single scale is fast for implementation. In 
summary, given a video U, we obtain a set of trajectories 

T(U) = {Ti,T2,--- (2) 

where K is the number of trajectories, and denotes the 
trajectory in the original spatial scale: 

n = {{x\, (4, yl 4),• • •,(4,4,4)}, o) 

where z^) is the pixel position of the point in 

trajectory and P is the length of trajectory (P = 15). 
These trajectories will be used for trajectory-constrained 
sampling and pooling in the process of TDD extraction, as 
described in the next section. 

4. Deep Convolutional Descriptors 

In this section, we describe a new video representa¬ 
tion, called trajectory-pooled deep-convolutional descrip¬ 
tor (TDD), which shares the benefits of both hand-crafted 
and deep-learned features. We first introduce the archi¬ 
tectures of convolutional networks (ConvNets) we used. 
Then, we show how to adapt the ConvNets trained on large 
datasets to extract multi-scale convolutional feature maps. 
Finally, based on improved trajectories and convolutional 
feature maps, we describe the details of how to calculate 
TDDs. 

4.1. Convolutional networks 

Our TDD starts with designing deep ConvNets for 
extracting convolutional feature maps. In principle, any 
kind of ConvNet architecture can be adopted for TDD 
extraction. In our implementation, we choose the two- 
stream ConvNets [24] due to their good performance on the 
datasets of UCFIOI and HMDB51. 

The two-stream ConvNets contain two separate Con¬ 
vNets, namely spatial nets and temporal nets. Spatial nets 
are designed for capturing static appearance cues, which 
are trained on single frame images (224 x 224 x 3), 


while temporal nets aim to describe the dynamic motion 
information, whose input are volumes of stacking optical 
flow fields (224 x 224 x 2F, F is the number of stacking 
flows). Meanwhile, decoupling the spatial and temporal 
nets also allows to exploit the available images by pre¬ 
training spatial nets on the ImageNet challenge dataset [6], 
and explicitly handle motion information with optical flow 
algorithms for temporal nets 

The details about ConvNets are shown in Table 1 . This 
ConvNet architecture is original from the Clarifai networks 
[41] and adapted to the task of action recognition with less 
Alters in conv4 layer and lower-dimensional full? layer. But 
we make a small modiflcation. We use the same network 
architecture for both spatial and temporal net in addition to 
the input data layer, while the original two-stream ConvNets 
[24] ignore the second local response normalized (LRN) 
layer in the temporal net due to memory consumption 
problem. The implementation and training details can be 
found in Section 5. 

4.2. Convolutional feature maps 

Once the training of two-stream ConvNets is complete, 
we treat them as generic feature extractors to obtain the 
convolutional feature maps of videos. In general, for each 
video, we obtain these feature maps of spatial and temporal 
net in a frame-by-frame and volume-by-volume manner, 
respectively. In order to make the feature maps with equal 
temporal duration with input video, we pad the optical flow 
fields at the beginning with F — 1 copies of the optical flow 
field from the first frame, where F is the number of stacking 
optical flow. 

For each frame or volume, we take it as the input for spa¬ 
tial or temporal nets. We make two modifications about the 
spatial and temporal nets. The first one is that we remove 
the layers after the target layer for feature extraction. For 
example, to extract feature maps of conv4, we will remove 
the layers from conv5 to full8. Therefore, the output of 
spatial and temporal net will be the convolutional feature 
maps, which will be used for extracting TDD in the next 
subsection. 

The second modiflcation is that before each convolu¬ 
tional or pooling layer, with kernel size k, we conduct 
zero padding of the layer’s input with size \k/2\. This 
padding allows the input and output maps of these layers 
to have the same spatial extent. With this padding, it will 
be straightforward to map the positions of trajectory points 
in video to the coordinates of convolutional feature maps. 
A trajectory point with video coordinates {xp^i/p^Zp) in 
Equation (3) will be centered on (r x Xp,r x Pp^Zp) in 
convolutional map, where r is map size ratio with respective 
to input size, as listed in Table 1 . 

ConvNets are bottom-up architectures with a sequence 
of alternating convolutional and pooling layers. Different 


Layer 

convl 

pooll 

conv2 

pool2 

conv3 

conv4 

conv5 

pool5 

full6 

full? 

full8 

size 

7x7 

3x3 

5x5 

3x3 

3x3 

3x3 

3x3 

3x3 

- 

- 

- 

stride 

2 

2 

2 

2 

1 

1 

1 

2 

- 

- 

- 

channel 

96 

96 

256 

256 

512 

512 

512 

512 

4096 

2048 

101 

map size ratio 

1/2 

1/4 

1/8 

1/16 

1/16 

1/16 

1/16 

1/32 

- 

- 

- 

receptive field 

7x7 

11 X 11 

27 X 27 

43 X 43 

75 X 75 

107 X 107 

139 X 139 

171 X 171 

- 

- 

- 


Table 1. ConvNet Architectures. We use similar architectures to two-stream ConvNets [24], which are adapted to the task of action 
recognition from the Clarifai networks [41], with less filters in conv4 layer (512 vs. 1024) and lower-dimensional full? layer (2048 vs. 
4096). For layers of convl and conv2, local response normalized (LRN) is applied with parameters settings: n = 5, a = 5 x 10“^, = 

0.75. The layers of full6 and full? are regularised by using dropout and the full8 layer acts as a soft-max classifier. The activation function 
for all weight layers is the rectification linear unit (RELU). The size ratios of feature maps with respect to input data range from 1/2 to 
1/32, and the feature receptive fields vary from 7 x 7 to 171 x 171, for different convolutional and pooling layers (convl to pool5). 


layers of ConvNets have various receptive fields as shown 
in Table 1, ranging from 7 x 7 to 171 x 171. As described 
in paper [41], these different layers capture patterns from 
simple visual elements such as edges, to complex visual 
concepts such as parts and objects. The higher layers 
have larger receptive fields and obtain more invariant and 
discriminative features. Intuitively, these different layers 
describe the visual content at different levels, each of which 
is complementary to each other for the task of recognition. 
We will exploit this complimentary property of different 
layers during the extraction of TDD. Given a video V, we 
obtain a set of convolutional feature maps: 

C(V) = {C!,C^r-- (4) 

where G M^rnxWmxLxNm rri^^ feature map 

of spatial net, Hm is its height, Wm is its width, L is the 
video duration, and is the number of channels. G 

-^HmxWmxLxNm jg jy^th fg^ture map of temporal net, 

M is the number of layers for extracting TDD. 

4.3. Trajectory-pooled descriptors 

We will describe the method for extracting trajectory- 
pooled deep-convolutional descriptors (TDDs) from a set 
of improved trajectories T{V) and convolutional feature 
maps C{V) for a given video V. In essence, TDD 
is a kind of local trajectory-aligned descriptor computed 
in a 3D volume around the trajectory. TDDs from the 
spatial and temporal nets capture the appearance and motion 
information of this 3D volume, respectively. The size of 
the volume is N x N pixels and P frames, where N is 
the receptive field size and P is the trajectory length. The 
extraction of TDD is composed of two steps: feature map 
normalization and trajectory pooling. 

Normalization proves to be an effective strategy in de¬ 
signing features partially because it can reduce the influence 
of illumination. It has been widely exploited in local 
descriptors such as SIFT [19], HOG [5], and HOF [1?], and 
in deep learning such as local response normalization [14]. 
We apply the normalization strategy to the convolutional 
feature maps of two-stream ConvNets to suppress the 


activation burstiness of some neurons. We design two kinds 
of normalization methods: 

• Spatiotemporal Normalization. For spatiotemporal 

normalization, we normalize the feature map for each 
channel independently across the video spatiotemporal 
extent. Given a feature map C G of 

Equation (4), we normalize the convolutional feature 
value as follows: 

Cst{x, y, Z, n) = C{x, y, z, n)/maxV"j, (5) 
where maxV^^ is the maximum value of nth 

feature 

maps over the whole video spatiotemporal extent, 
which means maxV^^ = mdiXx^y^z C{x^y^ z^n). 
The spatiotemporal normalization method ensures that 
each convolutional feature channel ranges in the same 
interval, and thus contributes equally to flnal TDD 
recognition performance. 

• Channel Normalization. For channel normalization, 
we normalize the feature map for each pixel indepen¬ 
dently across the feature channels. We conduct chan¬ 
nel normalization for feature map C G M^xU^xLxAt 
as follows: 

Cch{x, y, Z, n) = C{x, y, z, n)/maxV^;,^’^, (6) 

where maxV^^^’^ is the maximum value of different 
feature channels at pixel position {x,y,z), that is 
maxV^^^’^ = max^ (7(x, z, n). This channel 

normalization is able to make sure that the feature 
value of each pixel range in the same interval, and 
let each pixel make the equal contribution in the flnal 
representation. 

After the step of feature normalization, we will extract 
TDDs based on trajectories and normalized convolutional 
feature maps by using trajectory pooling. Speciflcdly, 
given a trajectory and a normalized feature map C^, 
which is the -layer feature map after either spatiotem¬ 
poral normalization or channel normalization from spatial 
net or temporal net (a G {s, t}), we conduct sum-pooling of 


















the normalized feature maps over the 3D volume centered 
at the trajectory as follows: 

_ p _ 

D{Tk,C^) = '^C^{{rm X x!^),{rm x y^),z^), (7) 

P=1 

where z^) is the point position of video coor¬ 

dinates in trajectory Tk, Vm is the m^^-layer map size ratio 
with respective to input size as listed in Table 1, (•) is the 
rounding operation. D{Tk, C^) is called trajectory-pooled 
deep convolutional descriptor, and is a new kind of feature 
combing the merits of both improved dense trajectories and 
two-stream ConvNets. 

Multi-scale TDD extension. The above description on 
TDD extraction is about the single scale, we will present the 
multi-scale extension of TDD. For improved trajectory, it 
samples points and tracks them on multi-scale videos, while 
fixes the spatial extent of HOG, HOF, and MBH descriptors 
as 32 X 32. The original method needs to conduct point 
tracking and descriptor calculation in multi-scale settings. 
In our implementation, we try a more efficient multi¬ 
scale strategy. Specifically, we calculate optical fiow and 
track point in a single scale. Then we construct multi¬ 
scale pyramid representations of video frames and optical 
fiow fields. These pyramid representations are fed into 
the two stream ConvNets and transformed into multi-scale 
convolutional feature maps as shown in Figure 2. Based on 
multi-scale convolutional maps and single-scale improved 
trajectories, we are able to compute multi-scale TDDs 
efficiently, by applying trajectory pooling to multi-scale 
convolutional feature maps as described above. The only 
modification to different scales is to replace feature map 
size ratio in Equation (7) with x 5, where s is 
the scale of current feature map. In practice, compared 
with improved trajectories, we use less scales with s = 
l/2,l/v^,l,V2,2. 

5. Experiments 

In this section, we first present the details of datasets and 
their evaluation scheme. Then, we describe the details of 
our method. Finally, we give the experimental results and 
compare TDD with the state of the art. 

5.1. Datasets 

In order to verify the effectiveness of TDDs, we conduct 
experiments on two public large datasets, namely HMDB51 
[15] and UCFIOI [26]. The HMDB51 dataset is a large 
collection of realistic videos from various sources, includ¬ 
ing movies and web videos. The dataset is composed of 

6, 766 video clips from 51 action categories, with each 
category containing at least 100 clips. Our experiments 
follow the original evaluation scheme using three different 


training/testing splits. In each split, each action class has 
70 clips for training and 30 clips for testing. The average 
accuracy over these three splits is used to measure the final 
performance. 

The UCFIOI dataset contains 101 action classes and 
there are at least 100 video clips for each class. The whole 
dataset contains 13,320 video clips, which are divided 
into 25 groups for each action category. We follow the 
evaluation scheme of the THUMOS13 challenge [11] and 
adopt the three training/testing splits for evaluation. As 
UCFIOI is larger than HMDB51, we use the UCFIOI 
dataset to train two-stream ConvNets initially, and transfer 
this learned model for TDD extraction on the HMDB51 
dataset. 

5.2. Implementation details 

Two-stream ConvNets training. Training deep Con¬ 
vNets is more challenging for action recognition as action 
is more complex than object and the available dataset 
is extremely small compared with the ImageNet dataset 
[6]. We choose the training dataset of UCFIOI splitl for 
learning two-stream ConvNets as it is probably the largest 
public available dataset. We use the Caffe toolbox [10] 
for ConvNet implementation. The network weights are 
learnt using the mini-batch (set to 256) stochastic gradient 
descent with momentum (set to 0.9). For spatial net, we 
first resize the frame to make the smaller side as 256, and 
then randomly crop a224 x 224 region from the frame. It 
then undergoes random horizontal Hipping. We pre-train 
the network with the public available model [4]. Finally, 
we fine tune the model parameters on the UCFIOI dataset, 
where the learning rate is set as 10“^, decreased to 10“^ 
after 14K iterations, and training stopped at 20iT iterations. 

For temporal net, its input is 3D volume of stacking 
optical fiows fields. We choose the TVLl optical fiow 
algorithm [40] and use the OpenCV implementation, due 
to its balance between accuracy and efficiency. For fast 
computation, we discretize the values of optical fiow fields 
into integers and set their range as 0-255 just like images. 
Specifically, we choose to stack 10 frames of optical 
fiow fields to keep a balance between performance and 
efficiency. We train temporal net on UCFIOI from scratch. 
As the dataset is relatively small, we use high dropout ratio 
to improve the generalization capacity of trained model. We 
set dropout 0.9 for full6 layer and dropout 0.8 for full? layer. 
The training procedure of temporal net is similar to spatial 
net and a 224 x 224 x 20 sub-volume is randomly cropped 
and flipped from training video. The learning rate is initially 
set as 10“^ and decreases to 10“^ after 50K iterations. It 
is then reduced to 10“^ after 70K iterations and training is 
stopped at 90K iterations. 

Results of two-stream ConvNets. To evaluate the 
trained model, as in [24], we select 25 frames for each 






Figure 3. Exploration of different settings in TDD on the HMDB51 
dataset. Left: Performance trend with varying PCA reduced di¬ 
mension. Right: Comparison of different normalization methods. 
“Combine” means the fusion of spatiotemporal normalization and 
channel normalization. 


video clip and obtain 10 crops for each frame. The final 
recognition result is the average across these crops and 
frames. We obtain 71.2% recognition accuracy with spatial 
net and 80.1% with temporal net. The performance of 
our implemented two-stream ConvNets is 84.7%, which 
is similar to that of two-stream ConvNets [24] (85.6%). 
However, obtaining ConvNets with high performance is 
not the final goal of this paper, and we aim to verify the 
effectiveness of TDDs. 

Feature encoding. We choose Fisher vector [23] to 
encode the TDDs of a video clip into high dimensional 
representation as its effectiveness for action recognition has 
been verified in previous works [38, 27], and then use a 
linear SVM as the classifer {C = 100). In order to train 
GMMs, we first de-correlate TDD with PCA and reduce its 
dimension to D. Then, we train a GMM with K (K = 256) 
mixtures, and finally the video is represented with a 2KD- 
dimensional vector. 

5.3. Exploration experiments 

Dimension reduction. To specify the PCA dimension 
of TDD for GMM training and Fisher vector encoding, we 
first explore different dimensions reduced by PCA on the 
HMDB51 dataset, with conv4 descriptors from spatial net. 
In this exploration experiment, we use the spatiotemporal 
normalization method for TDD and the results are shown in 
the left of Figure 3. We vary the dimension from 32 to 256 
and the results show that dimension 64 achieves the high 
performance, and higher dimension may cause performance 
degradation. Thus, we fix the dimension as 64 for TDDs in 
the remainder of this section. 

Normalization method. Another important component 
in TDD design is the normalization method and we have 
presented two normalization methods: spatiotemporal nor¬ 
malization (ST. Norm.) and channel normalization (Cha. 
Norm.) in Section 4.3 . We conduct experiments to 
investigate the effectiveness of normalization methods by 
using conv4 descriptors from spatial net on the HMDB51 
dataset, and the results are shown in the right of Figure 
3. We see that normalization is important for improving 


Algorithm 

HMDB51 

UCFIOI 

HOG [31,32] 

40.2% 

72.4% 

HOF [31,32] 

48.9% 

76.0% 

MBH [31,32] 

52.1% 

80.8% 

HOF-hMBH [31, 32] 

54.7% 

82.2% 

iDT [31,32] 

57 . 2 % 

84 . 7 % 

Spatial net [24] 

40.5% 

73.0% 

Temporal net [24] 

54.6% 

83.7% 

Two-stream ConvNets [24] 

59 . 4 % 

88 . 0 % 

Spatial conv4 

48.5% 

81.9% 

Spatial conv5 

47.2% 

80.9% 

Spatial conv4 and conv5 

50 . 0 % 

82 . 8 % 

Temporal conv3 

54.5% 

81.7% 

Temporal conv4 

51.2% 

80.1% 

Temporal conv3 and conv4 

54 . 9 % 

82 . 2 % 

TDD 

63.2% 

90.3% 

TDD and iDT 

65 . 9 % 

91 . 5 % 


Table 3. Performance of TDD on the HMDB51 dataset and 
UCFIOI dataset. We compare our proposed TDD with iDT 
features [31] and two-stream ConvNets [24]. We also explore the 
complementary properties TDD features and iDT features. The 
combination of them can further boost the performance. 

performance and spatiotemporal normalization is the best 
choice. We also explore the complementary property 
of these two normalization methods by fusing the Fisher 
vectors of them, and observe that it can further improve the 
performance. Therefore, in the remainder of this section, 
we will use the combined representation obtained from 
these two normalization methods for TDDs. 

Different layers. Finally we investigate the performance 
of TDDs from different layers of spatial and temporal nets 
on the HMDB51 dataset, and the results are summarized in 
Table 2. For layers of conv5, conv4, and conv3, we use the 
outputs of RELU activations, and for layers of conv2 and 
convl, we choose the outputs of max pooling layers after 
convolution operations. We see that descriptors of layers 
conv4 and conv5 obtain highest recognition performance 
for spatial net, while the ones of layers conv3 and conv4 are 
top performers for temporal net. Therefore, in the following 
evaluation of TDD, we choose the descriptors from conv4 
and conv5 layers for spatial nets, and conv3 and conv4 
layers for temporal nets. 

5.4. Evaluation of TDDs 

In this section, we evaluate the performance of our 
proposed TDDs on the HMDB51 and UCFIOI dataset, and 
the experimental results are summarized in Table 3. We first 
compare the performance of TDDs with that of improved 
trajectories. The convolutional descriptors of spatial net 
are much better than HOG descriptors, which indicates that 
deep-learned features contains more discriminative capacity 
than hand-crafted features. For convolutional descriptors 
































Spatial ConvNets 

Temporal ConvNets 

Convolutional layer 

convl 

conv2 

conv3 

conv4 

conv5 

convl 

conv2 

conv3 

conv4 

conv5 

Recognition accuracy 

24.1% 

33.9% 

41.9% 

48.5% 

47.2% 

39.2% 

50.7% 

54,5% 

51.2% 

46.1% 


Table 2. The performance of different layers of spatial nets and temporal nets on the HMDB51 dataset. 



(a) RGB (b) Flow-x (c) Flow-y (d) S-conv4 (e) S-conv5 (f) T-conv3 (g) T-conv4 

Figure 4. Examples of video frames, optical flow flelds, and their corresponding feature maps of spatial nets and temporal nets. 


of temporal net, they are better than or comparable to the 
descriptors of HOF and MBH, but the improvement is not 
so evident as spatial convolutional descriptors. The reason 
may be that HOF and MBH calculation is based on warped 
optical flow instead of original optical flow, which has been 
proved to be pretty effective for HOF descriptor [31]. We 
consider using warped flow for TDDs extraction in the 
future. 

We also compare the performance of TDDs with the 
two-stream ConvNets. Although our trained two-stream 
ConvNets obtain slightly lower performance than theirs, 
we see that our spatial TDDs outperform spatial nets by 
a large margin and temporal TDD is comparable to their 
temporal net. These results indicate the fact that trajectory- 
constrained sampling and pooling is an effective strategy 
for improving recognition performance, in particular for 
spatial TDDs. We also notice that the combined TDDs from 
spatial and temporal nets outperform two-stream ConvNets 
by around 4% and 2% on the two datasets, respectively. 
We also show some examples of video frames, optical flow 
flelds, and their corresponding feature maps in Figure 4. 
From these examples, we see that the convolutional feature 
maps are relatively sparse and exhibit high correlation with 
the action areas. 

Finally, we explore a practical way to improve the 
recognition performance of action recognition system by 
combining TDDs with iDTs, using early fusion of Fisher 
vector representation. The recognition results are shown 
in Table 3, and the fusion of them can further boost the 
performance. This further improvement indicates our TDDs 
are complementary to those low-level local features. 

Computational costs. Compared with iDT, we only 
track points on a single scale and extract original flow 


HMDB51 

UCFIOI 

STIP+BoVW [15] 

23.0% 

STIP+BoVW [26] 

43.9% 

Motionlets [35] 

42.1% 

Deep Net [12] 

63.3% 

DT+BoVW [30] 

46.6% 

DT+VLAD [3] 

79.9% 

DT+MVSV [3] 

55.9% 

DT+MVSV [3] 

83.5% 

iDT+FV [31] 

57.2% 

iDT+FV [32] 

85.9% 

iDT+HSV [21] 

61.1% 

iDT+HSV [21] 

87.9% 

Two Stream [24] 

59.4% 

Two Stream [24] 

88.0% 

TDD+FV 

63.2% 

TDD+FV 

90.3% 

Our best result 

65 . 9 % 

Our best result 

91 . 5 % 


Table 4. Comparison of TDD to the state of the art. We separately 
present the results of TDDs and our best results obtained with early 
fusion of TDDs and iDTs. 


instead of warped flow. The ConvNets are implemented by 
Cuda and computing is very efficient. 

5.5. Comparison to the state of the art 

Table 4 compares our recognition results with several 
recently published methods on the dataset of HMDB51 and 
UCFIOI. The performance of TDDs outperforms previous 
methods on both datasets. On the HMDB51 dataset, our 
best result outperforms other methods by 4.8%, and on the 
UCFIOI dataset, our best result outperforms by 3.5%. This 
superior performance of TDDs indicates the effectiveness 
of introducing trajectory-constrained sampling and pooling 
into deep-learned features. 

6. Conclusions 

This paper has proposed an effective video presenta¬ 
tion, called trajectory-pooled deep-convolutional descriptor 
(TDD), which integrates the advantages of hand-crafted and 
deep-learned features. Deep architectures are utilized to 
learn discriminative convolutional feature maps, and then 



































the strategies of trajectory-constrained sampling and pool¬ 
ing are adopted to aggregate these convolutional features 
into TDDs. Our features achieve superior performance 
on two datasets for action recognition, as evidenced by 
comparison with the state-of-the-art methods. 
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